In this document, we analyse missing data patterns in the dataset. The dataset has been obtained as described in the document “Download subsample of pollution data” (ie using europollution package and downloading data for a random sample of 10% of the stations, from 2013 to 2018). In a first step, we focus on one pollutant only, PM10. The present document is build so that it can be easily modified and recomputed for other pollutants or another sample.
In order to assess by guesswork the representativity of the subsample drawn, we look at the location of the station selected:
map_raster <- map_data("world")
wrangled_metadata %>%
filter(longitude < 47 & longitude > -25 & latitude > 35 & latitude < 71) %>%
ggplot() +
geom_point(aes(x = longitude, y = latitude, color = in_subsample), size = .05) +
# geom_polygon(data = map_raster,
# aes(x = long, y = lat, group=group),
# color = "grey", fill = NA) +
coord_map(projection = "azequidistant") +
scale_color_brewer(palette = "Dark2") +
labs(title = "Location of stations present in the random subsample considered", x = "", y = "", color = "Station in sample") The sample seems to cover most of Europe and should therefore be somehow representative. Note that the previous map does not represent the full set of stations. For readability reasons, we considered a zoomed version of the map. For more details, one can refer to the following interactive map.
pal <- colorFactor("Dark2", domain = wrangled_metadata$in_subsample)
color_in_subsample <- pal(wrangled_metadata$in_subsample)
content <- paste("Station type:", wrangled_metadata$station_type, "<br/>",
"Station area:", wrangled_metadata$station_area, "<br/>")
wrangled_metadata %>%
select(station_id, longitude,latitude, in_subsample, station_area, station_type) %>%
distinct() %>%
leaflet() %>%
addTiles() %>%
addProviderTiles("Esri.WorldGrayCanvas") %>%
addCircles(col = color_in_subsample, popup = content, lng = ~longitude, lat = ~latitude) %>%
addLegend(pal = pal, values = ~wrangled_metadata$in_subsample , title = "Station in sample")We also consider the number of stations per country:
nb_station_country <- wrangled_metadata %>%
group_by(country_iso) %>%
summarise(
nb_station_country = n(),
nb_station_country_subsample = sum(in_subsample),
prop_in_country_whole = n()/nrow(wrangled_metadata),
prop_in_country_subsample = sum(in_subsample)/sum(wrangled_metadata$in_subsample)
)
nb_station_country %>%
kable(col.names = c("Country ISO", "Number of stations in country", "Number of stations in subsample in country", "Proportion of stations located in this country", "Proportion of stations in the subsample located in this country"))| Country ISO | Number of stations in country | Number of stations in subsample in country | Proportion of stations located in this country | Proportion of stations in the subsample located in this country |
|---|---|---|---|---|
| AD | 9 | 8 | 0.0002019 | 0.0019222 |
| AL | 53 | 0 | 0.0011892 | 0.0000000 |
| AT | 2025 | 235 | 0.0454382 | 0.0564632 |
| BA | 105 | 12 | 0.0023561 | 0.0028832 |
| BE | 1821 | 172 | 0.0408607 | 0.0413263 |
| BG | 299 | 38 | 0.0067092 | 0.0091302 |
| CH | 307 | 25 | 0.0068887 | 0.0060067 |
| CY | 90 | 6 | 0.0020195 | 0.0014416 |
| CZ | 1867 | 247 | 0.0418929 | 0.0593465 |
| DE | 5475 | 454 | 0.1228515 | 0.1090822 |
| DK | 207 | 5 | 0.0046448 | 0.0012013 |
| EE | 112 | 47 | 0.0025131 | 0.0112926 |
| ES | 6143 | 467 | 0.1378405 | 0.1122057 |
| FI | 422 | 16 | 0.0094691 | 0.0038443 |
| FR | 5926 | 658 | 0.1329713 | 0.1580971 |
| GB | 2402 | 261 | 0.0538976 | 0.0627102 |
| GE | 56 | 0 | 0.0012566 | 0.0000000 |
| GI | 29 | 0 | 0.0006507 | 0.0000000 |
| GR | 288 | 29 | 0.0064623 | 0.0069678 |
| HR | 159 | 0 | 0.0035677 | 0.0000000 |
| HU | 281 | 39 | 0.0063053 | 0.0093705 |
| IE | 320 | 23 | 0.0071804 | 0.0055262 |
| IS | 132 | 26 | 0.0029619 | 0.0062470 |
| IT | 4974 | 483 | 0.1116097 | 0.1160500 |
| LT | 187 | 0 | 0.0041960 | 0.0000000 |
| LU | 79 | 6 | 0.0017727 | 0.0014416 |
| LV | 129 | 31 | 0.0028946 | 0.0074483 |
| ME | 26 | 6 | 0.0005834 | 0.0014416 |
| MK | 138 | 9 | 0.0030965 | 0.0021624 |
| MT | 112 | 0 | 0.0025131 | 0.0000000 |
| NL | 742 | 56 | 0.0166495 | 0.0134551 |
| NO | 380 | 40 | 0.0085267 | 0.0096108 |
| PL | 2633 | 241 | 0.0590809 | 0.0579049 |
| PT | 553 | 37 | 0.0124086 | 0.0088900 |
| RO | 1312 | 121 | 0.0294395 | 0.0290726 |
| RS | 154 | 0 | 0.0034555 | 0.0000000 |
| SE | 1556 | 61 | 0.0349145 | 0.0146564 |
| SI | 206 | 0 | 0.0046224 | 0.0000000 |
| SK | 227 | 58 | 0.0050936 | 0.0139356 |
| TR | 2579 | 245 | 0.0578692 | 0.0588659 |
| XK | 51 | 0 | 0.0011444 | 0.0000000 |
nb_station_country %>%
pivot_longer(starts_with("nb_station"), names_to = "in_sample", values_to = "number") %>%
mutate(in_sample = str_sub(in_sample, 12, nchar(in_sample))) %>%
ggplot() +
geom_point(aes(x = country_iso, y = number, color = in_sample)) +
scale_color_brewer(palette = "Dark2") +
labs(title = "Number of stations by country in overall sample and subsample", x = "Country", y = "Number of stations", color = "Number in") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))nb_station_country %>%
pivot_longer(starts_with("prop_in"), names_to = "in_sample", values_to = "prop") %>%
mutate(in_sample = str_sub(in_sample, 17, nchar(in_sample))) %>%
ggplot() +
geom_point(aes(x = country_iso, y = prop, color = in_sample)) +
scale_color_brewer(palette = "Dark2") +
labs(title = "Proportion of stations by country in overall sample and subsample", x = "Country", y = "Proportion of stations", color = "Proportion in") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))Here, we investigate whether missing PM10 concentration data varies across different dimentions.
# To do so, we will use the following function
hist_missing_by <- function(df, variable_missing, grouping_var, facet_var = NA) {
graph <- df %>%
group_by_(grouping_var, facet_var) %>%
summarise(share_missing = sum(is.na(.data[[variable_missing]]))/n()) %>%
arrange(share_missing) %>%
ggplot() +
# geom_point(aes(x = grouping_var, y = share_missing), color = "darkBlue") +
geom_col(aes(x = .data[[grouping_var]], y = share_missing), fill = "darkBlue") +
labs(title = paste("Share of missing", variable_missing,"data by", grouping_var), x = grouping_var, y = paste("Share of missing", variable_missing,"data")) +
ylim(0, 1) +
facet_wrap(facets = facet_var)
# graph <- df %>%
# mutate(missing = is.na(concentration)) %>%
# ggplot() +
# geom_bar(aes(x = .data[[grouping_var]], fill = missing), position = "fill") +
# facet_wrap(facet_var) +
# labs(title = paste("Share of missing", variable_missing,"data by", grouping_var), x = grouping_var, y = paste("Share of missing", variable_missing,"data")) +
# scale_fill_brewer(palette = "Blues")
return(graph)
}The overall share of missing value is 0.3781148.
We can break this down by countries.
missing_share_country <- data %>%
group_by(country_iso) %>%
summarise(share_missing = sum(is.na(concentration))/n()) %>%
arrange(share_missing)
missing_share_country %>%
kable()| country_iso | share_missing |
|---|---|
| AT | 0.0260500 |
| MK | 0.0313078 |
| AD | 0.0324936 |
| BE | 0.0376590 |
| GB | 0.0652766 |
| FR | 0.0693536 |
| NL | 0.0797792 |
| NO | 0.0924395 |
| TR | 0.1005766 |
| LU | 0.1286677 |
| FI | 0.1376533 |
| DE | 0.1531009 |
| SK | 0.1845098 |
| PT | 0.1921190 |
| BA | 0.2180321 |
| CZ | 0.3097139 |
| SE | 0.3207205 |
| IS | 0.4095204 |
| ES | 0.4204464 |
| PL | 0.4981073 |
| GR | 0.5257783 |
| IT | 0.7241246 |
| EE | 0.9399007 |
| HU | 0.9599505 |
| ME | 0.9601598 |
| BG | 0.9615132 |
| LV | 0.9619673 |
| IE | 0.9655351 |
| CH | 0.9662758 |
| DK | 0.9670484 |
| RO | 0.9713371 |
missing_share_country %>%
ggplot() +
# geom_point(aes(x = reorder(country_iso, share_missing), y = share_missing), color = "darkBlue") +
geom_col(aes(x = reorder(country_iso, share_missing), y = share_missing), fill = "darkBlue") +
labs(title = "Share of missing PM10 concentration data by country", x = "Country", y = "Share of missing PM10 concentration data") +
coord_flip() One can notice that the share of missing values varies drastically across countries. This can be due to the quality of monitoring of air pollution stations. This variation might also come from our data wrangling. We need to check that.
Overall, we notice that, along the time dimension data almost seems to be missing at random appart from a decreasing trend in the proportion of missing data
data %>%
mutate(year = factor(year(date))) %>%
hist_missing_by(variable_missing = "concentration", grouping_var = "year")One may notice that the share of missing data slightly decreased over time. This might be due to improvement in air measurement capacity.
We look at whether this variation comes from some countries more than others.
data %>%
mutate(year = year(date)) %>%
group_by(country_iso, year) %>%
summarise(share_missing = sum(is.na(concentration))/n()) %>%
arrange(share_missing) %>%
ggplot() +
geom_point(aes(x = country_iso, y = share_missing, color = factor(year))) +
scale_color_brewer(palette = "Spectral") +
labs(title = "Share of missing PM10 concentration data by country", x = "Country", y = "Share of missing PM10 concentration data") +
ylim(0, 1)Zooming in and looking at monthly data, we can see that this decreasing trend is somehow step wise, with some important decreases between December and January.
data %>%
mutate(year_month = paste(year(date), str_pad(month(date), 2, pad = 0))) %>%
hist_missing_by(variable_missing = "concentration", grouping_var = "year_month") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))We now investigate whether the share of missing data varies across month of the year, day of the month and day of the week. There does not seem to be huge variations.
data %>%
mutate(month = factor(month(date))) %>%
hist_missing_by(variable_missing = "concentration", grouping_var = "month")data %>%
mutate(
year = factor(year(date)),
month = factor(month(date))
) %>%
hist_missing_by(variable_missing = "concentration", grouping_var = "month", facet_var = "year")data %>%
mutate(day_of_month = factor(day(date))) %>%
hist_missing_by(variable_missing = "concentration", grouping_var = "day_of_month")data %>%
mutate(
year = factor(year(date)),
day_of_month = factor(day(date))
) %>%
hist_missing_by(variable_missing = "concentration", grouping_var = "day_of_month", facet_var = "year")data %>%
mutate(
month = factor(month(date)),
day_of_month = factor(day(date))
) %>%
hist_missing_by(variable_missing = "concentration", grouping_var = "day_of_month", facet_var = "month")data %>%
mutate(day_in_week = factor(wday(date))) %>%
hist_missing_by(variable_missing = "concentration", grouping_var = "day_in_week")data %>%
mutate(
year = factor(year(date)),
day_in_week = factor(wday(date))
) %>%
hist_missing_by(variable_missing = "concentration", grouping_var = "day_in_week", facet_var = "year")data %>%
mutate(
month = factor(month(date)),
day_in_week = factor(wday(date))
) %>%
hist_missing_by(variable_missing = "concentration", grouping_var = "day_in_week", facet_var = "month")Now we investigate whether there are some variation in missingness patterns across time.
data %>%
mutate(hour = factor(hour(date))) %>%
hist_missing_by(variable_missing = "concentration", grouping_var = "hour")One may notice that there is a lot less missing data around 11pm. This might be problematic and needs to be investigated.
data %>%
mutate(hour = factor(hour(date))) %>%
hist_missing_by(variable_missing = "concentration", grouping_var = "hour", facet_var = "country_iso")This pattern seems to be widespread across countries, even though some countries do not seem to be affected.